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Abstract 

We show that the set of binary words containing overlaps is not unambiguously 
context-free and that the set of ternary words containing overlaps is not context-free. 
We also show that the set of binary words that are not subwords of the Thue-Morse 
word is not unambiguously context-free. 

1 Introduction 

An overlap is a word of the form axaxa, where a is a single letter and x is a (possibly empty) 
word. A word is overlap-free if it does not contain an overlap as a subword. Let L a (k) denote 
the language of all words over the alphabet {0, 1, . . . , k — 1} that contain an overlap as a 
subword. 

By applying the interchange lemma [3U1 EI] , Gabarro [20] proved that for k > 4, L (k) 
is not context-free, thus partially solving an open problem of Berstel jS]. Since the work 
of Gabarro, it has remained an open problem to determine whether or not L (2) and L a (3) 
are context-free. We show that L Q {2) is not unambiguously context-free and that L G (3) is 
not context-free. We also show that the set of binary words that are not subwords of the 
Thue-Morse word is not unambiguously context-free. 



2 Overlap- free words 

In this section we review some standard results concerning binary overlap-free words. 
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Let fj, denote the Thue-Morse morphism, that is, the morphism that maps — > 01 and 
1 — > 10. The Thue-Morse word 

t = fj,"(p) = 0110100110010110 • • • 

is well-known to be overlap-free [2E1 CIZ] ■ 
Let 

A = {00,11,010010,101101}. 
Pansiot [HI] and Brlek [H] proved the set of squares in t is exactly 

Using this result, one easily shows (see, for example, [13]) that for any position i, there is at 
most one square in t beginning at position i. 

3 Binary words containing overlaps 

Theorem 1. The language L D (2) is not unambiguously context-free. 

Proof. We will need the following result due to Fatou jE] (a more convenient reference may 
be [H21 Part VIII, Chap. 3, No. 167]; a stronger result was conjectured by Polya and proved 
by Carlson [2]): A power series ^2 n >oCLnZ n with integer coefficients and radius of convergence 
1 is either rational or transcendental over Q(X). 
Let 

n>0 

be the generating series of the overlap-free words. That is, a n is the number of overlap-free 
words of length n over a two letter alphabet. By the Chomsky-Schiitzenberger Theorem 
(see [2H Chap. 16] for a proof; see also, for example, [21 EE] for applications to other 
languages), if L a (2) is unambiguously context-free, then F(X) is algebraic over Q(X). To 
prove the theorem it suffices then to show that F(X) is transcendental. 

We will need the following result due to Lepisto [2E] on the enumeration of overlap-free 
words (compare also the earlier work of Restivo and Salemi [23], Kfoury [22] , Kobayashi |23| . 
and Cassaigne [Hj): 

a n = fi(n L217 ) and a n = 0(n L369 ). (1) 

Since a n = 0(n L369 ), F(X), as a complex power series, has radius of convergence 1, and 
so by Fatou's theorem is either rational or transcendental over Q(X). To complete the proof 
we must show that F(X) is not rational. If F(X) were rational, then the coefficients a n 
could be written in the form 

m 

a n = y^Aj(n)a", 

8=1 
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for some m, where a« is a characteristic root of multiplicity rij of the linear recurrence satisfied 
by (o n )n>0; an d Ai(X) is a polynomial of degree at most — 1 (see [HI Section 1.1.6]). But 
from (JTJ we see that this is not possible, so F(X) is not rational and the proof is complete. □ 

We conclude this section by considering a variation on the language L a (k). Given a word 
w, and writing w = xy, we say the word yx is a conjugate of w. Let L (fc) denote the 
language of all words w over the alphabet {0, 1, . . . , k — 1} such that some conjugate of w 
contains an overlap as a subword. 

Theorem 2. The language L D (2) is not unambiguously context-free. 

Proof. Harju [21] showed that the binary circular overlap-free words have lengths of the form 
2 n or 3 ■ 2 n , n > 0. The generating series of the complement of L (2) is a thus a so-called 
"gap series" (or "lacunary series"). By Hadamard's gap theorem [S3J Theorem 16.6] it admits 
its circle of convergence as a natural boundary and hence is transcendental. Applying the 
Chomsky-Schiitzenberger Theorem, we conclude that L (2) is not unambiguously context- 
free. □ 



4 Ternary words containing overlaps 

In this section we adapt the argument of Gabarro [20 to prove the following theorem. 

Theorem 3. The language L Q (3) is not context-free. 

Before beginning the proof, we recall the interchange lemma [3*U] . 

Theorem 4 (Ogden, Ross, and Winklmann). Let L C S* be a context-free language. 
There exists a constant c, depending only on L, such that for alln > 2, all subsets R C LnS n ; 
and all m, 2 < m < n, there exists a subset Z C R, Z = {z±, z 2 , ■ ■ ■ , Zk}, such that 

fa) k > , , L 2 ; 

(b) Zi = WiXiyi, 1 < % < k; 

(c) \wi\ = \w 2 \ = ■■■ = \w k \; 

(d) \yi\ = 1 2/2 1 = • ■ ■ = \y k \; 

(e) m/2 < \x\\ = \x 2 \ = ■ ■ ■ = \xk\ < m; 

(f) WiXjyi 6 L, 1 < i, j < k. 

Proof of Theorem^ Let n = 2 2k+1 + 1 for some k > 0. Let x = fi 2h (0) and let w = Oxx. 
Then w is an overlap, but no proper subword of w is an overlap. To see this, note that xx is 
a subword of the Thue-Morse word and is therefore overlap-free. Any overlap contained in 
w must therefore begin from the first position of w. If w begins with two distinct overlaps, 
then xx begins with two distinct squares, contradicting the observation made in Section [21 
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Suppose that L G (3) is context-free. Let if) be the morphism defined by ^(0) = and 
= ^(2) = 1. Define 

R = {Oyy : y G ^'\x)}. 

Note that \R\ = 2^ n ~ 1 ^ 4: . Applying the interchange lemma, we see that there exists Z C R 
with 

2 (n-l)/4 

Choosing m = (n — l)/2, and recalling that if Zj = WiXiyi G Z, then m/2 < < m, we 
see that WiXjyi G L only if x» = Xj. Fixing Xi, we easily verify that there are at most 2^ n ~ 1 '' 8 
words WjXjyj with Ob ^ — J ? SO that |Z| < 2*™ ^' 8 , contradicting (j2J) for n sufficiently large. 
This concludes the proof. □ 



5 Generalized Thue— Morse words 

In this section we show that the set of binary words that are not subwords of the Thue-Morse 
word t is not unambiguously context-free. We also show that this result holds for generalized 
Thue-Morse words as well. 

For an infinite word w, let Pw(n) denote the subword complexity function of w. That is, 
the value of p w (n) is equal to the number of subwords of length n that occur in w. Let L w 
denote the set of words over the alphabet of w that are not subwords of w. 

Brlek jH] and de Luca and Varricchio |2Zj (see also the subsequent work of Avgustinovich 
jlj, Tapsoba jHS], Frid [H|, and Tromp and Shallit [3H]) determined that 

(2 if n = 0, 

I 4 if n — 1, 

4n-2 a if n = 2 a + 6, where a > 1, < b < 2 a ~\ 

- 2 a - 2b if n = 2 a + 2 a ~ 1 + 6, where a > 1, < b < 2 a ~ l . 

Based on this characterization, we prove the following theorem. 
Theorem 5. The language L t is not unambiguously context-free. 
Proof. Let 



FpO = J>(n)X" 



n>l 



be the generating series of the subwords of the Thue-Morse word. We show that F(X) is 
transcendental over Q(X). Suppose to the contrary that F(X) is algebraic. Then the series 

G(X) = J2(Pt(n+l)-Pt(n))X n , 

n>l 
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whose coefficients form the sequence of first differences of Pt(n), is also algebraic. Note that 
for all n > 1, 

p t (n + 1) -p t {n) < 4, 

so that the coefficients of G(X) are bounded. Applying Fatou's Theorem to G(X), we see 
that G(X) is either rational or transcendental. By assumption, G(X) is algebraic, so it must 
be rational. But then the sequence 

A = (p t (n + 1) - pt{n)) n >i 

is ultimately periodic. We easily verify from Q that this is not the case: for instance, 
A contains arbitrarily large "runs" of 4's. This contradiction implies the transcendence of 
F(X). 

Alternatively, one may note that the series H(X), whose coefficients form the sequence 
of second differences of pt{n), is a gap series, and one may therefore apply Hadamard's gap 
theorem to H(X). 

The desired result follows by applying the Chomsky-Schutzenberger Theorem. □ 

Note: The use of analytic techniques in the proof of Theorem El may be avoided by applying 
instead the theorems of Christol JOIE] and Cobham [I2j- See the paper of Allouche [2j for 
some examples of this approach. 

Next we consider generalizations of the Thue-Morse word. Let S2(n) denote the sum of 
the digits in the base-2 expansion of n. It is well known that the Thue-Morse word t = 
£(0)£(l)t(2) • • • can be defined by t(n) = S2{n) mod 2. For k > 2, we define the generalized 
Thue-Morse word t& by tfc(n) = S2(n) mod k, so that t = t2- Tromp and Shallit 
characterized the subword complexity of these words as follows: 



Pt k {n + V 



' k if n = 0, 

k 2 if n — 1, 

k(kn - 2"- 1 ) if n = 2 a + b, where a > 1, < b < 2 a -\ 

k{kn - 2"- 1 - b) if n = 2 a + 2 a ~ l + 6, where a > 1, < b < 2 a ~\ 



One therefore proves the following result in a manner entirely analogous to that of TheoremEl 
Theorem 6. For k > 2, the language L tk is not unambiguously context-free. 



6 Discussion and future work 

To complete the work discussed here, it remains to determine whether or not the languages 
L a (2), L (2), and L tk are context-free. We discuss some related issues below. 

Mosse |2H1 and Frid jTZl CHI EHj have written several papers showing that a large class 
of words generated by iterating morphisms have subword complexity functions that behave 
similarly to that of the Thue-Morse word; i.e., they are piecewise linear on exponentially 
growing intervals. The first difference sequence of such subword complexity functions is 
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therefore either constant or not ultimately periodic. If it were possible to characterize those 
words for which the latter situation occurs, one might generalize the argument of Theorem 
to a larger class of words. 

One may also apply this argument in cases where the subword complexity function is 
not linear. As an example, we may consider the generating series of the paperfolding words. 
Allouche and Bousquet-Melou [3] have shown that the number f(n) of subwords of length n 
of the paperfolding words is given by 

if 1 < n < 3, 

if 2 a < n < T + 2 a -\ where a > 2, 
if 2 a + 2 a - 1 <n<2 a + 2 a - 1 + 2 a ~ 2 , where a > 2, 
if 2 a + 2"- 1 + 2 a ~ 2 <n< 2 a+ \ where a > 2, 

and they have shown that the corresponding generating function F(X) is transcendental. 
We may also deduce the transcendence of F(X) by considering the series H(X), whose 
coefficients form the second difference sequence of f(n). Noting that H(X) is a gap series, 
we may apply Hadamard's gap theorem to derive the desired result. 

Noting that F(X), along with the other generating functions considered earlier, has 
coefficients that are polynomially bounded, we take this opportunity to mention a remarkable 
recent result of D'Alessandro, Intrigila, and Varricchio [°Q: If a context-free language has only 
polynomially many words of length n, then its generating function is rational. Applying this 
result to the paperfolding words, for instance, one recovers a result of Lehr "2*5*1 13*]. namely, 
that the set of subwords of the paperfolding words is not context-free. 
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